Search CORE

27 research outputs found

Clustering in the Presence of Noise

Author: Haghtalab Nika
Publication venue: 'University of Waterloo'
Publication date: 08/08/2013
Field of study

Clustering, which is partitioning data into groups of similar objects, has a wide range of applications. In many cases unstructured data makes up a significant part of the input. Attempting to cluster such part of the data, which can be referred to as noise, can disturb the clustering on the remaining domain points. Despite the practical need for a framework of clustering that allows a portion of the data to remain unclustered, little research has been done so far in that direction. In this thesis, we take a step towards addressing the issue of clustering in the presence of noise in two parts. First, we develop a platform for clustering that has a cluster devoted to the "noise" points. Second, we examine the problem of "robustness" of clustering algorithms to the addition of noise. In the first part, we develop a formal framework for clustering that has a designated noise cluster. We formalize intuitively desirable input-output properties of clustering algorithms that have a noise cluster. We review some previously known algorithms, introduce new algorithms for this setting, and examine them with respect to the introduced properties. In the second part, we address the problem of robustness of clustering algorithms to the addition of unstructured data. We propose a simple and efficient method to turn any centroid-based clustering algorithm into a noise robust one that has a noise cluster. We discuss several rigorous measures of robustness and prove performance guarantees for our method with respect to these measures under the assumption that the noise-free data satisfies some niceness properties and the noise satisfies some mildness properties. We also prove that more straightforward ways of adding robustness to clustering algorithms fail to achieve the above mentioned guarantees

University of Waterloo's Institutional Repository

Efficient Learning of Linear Separators under Bounded Noise

Author: Awasthi Pranjal
Balcan Maria-Florina
Haghtalab Nika
Urner Ruth
Publication venue
Publication date: 12/03/2015
Field of study

We study the learnability of linear separators in

\Re^d

in the presence of bounded (a.k.a Massart) noise. This is a realistic generalization of the random classification noise model, where the adversary can flip each example

x

with probability

\eta(x) \leq \eta

. We provide the first polynomial time algorithm that can learn linear separators to arbitrarily small excess error in this noise model under the uniform distribution over the unit ball in

\Re^d

, for some constant value of

\eta

. While widely studied in the statistical learning theory community in the context of getting faster convergence rates, computationally efficient algorithms in this model had remained elusive. Our work provides the first evidence that one can indeed design algorithms achieving arbitrarily small excess error in polynomial time under this realistic noise model and thus opens up a new and exciting line of research. We additionally provide lower bounds showing that popular algorithms such as hinge loss minimization and averaging cannot lead to arbitrarily small excess error under Massart noise, even under the uniform distribution. Our work instead, makes use of a margin based technique developed in the context of active learning. As a result, our algorithm is also an active learning algorithm with label complexity that is only a logarithmic the desired excess error

\epsilon

arXiv.org e-Print Archive

MPG.PuRe

The Sample Complexity of Multi-Distribution Learning for VC Classes

Author: Awasthi Pranjal
Haghtalab Nika
Zhao Eric
Publication venue
Publication date: 22/07/2023
Field of study

Multi-distribution learning is a natural generalization of PAC learning to settings with multiple data distributions. There remains a significant gap between the known upper and lower bounds for PAC-learnable classes. In particular, though we understand the sample complexity of learning a VC dimension d class on

k

distributions to be

O(\epsilon^{-2} \ln(k)(d + k) + \min\{\epsilon^{-1} dk, \epsilon^{-4} \ln(k) d\})

, the best lower bound is

\Omega(\epsilon^{-2}(d + k \ln(k)))

. We discuss recent progress on this problem and some hurdles that are fundamental to the use of game dynamics in statistical learning.Comment: 11 pages. Authors are ordered alphabetically. Open problem presented at the 36th Annual Conference on Learning Theor

arXiv.org e-Print Archive

k-Center Clustering Under Perturbation Resilience

Author: Balcan Maria-Florina
Haghtalab Nika
White Colin
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 43rd International Colloquium on Automata, Languages, and Programming (ICALP 2016)
Publication date: 01/01/2016
Field of study

The k-center problem is a canonical and long-studied facility location and clustering problem with many applications in both its symmetric and asymmetric forms. Both versions of the problem have tight approximation factors on worst case instances: a 2-approximation for symmetric kcenter and an O(log*(k))-approximation for the asymmetric version. Therefore to improve on these ratios, one must go beyond the worst case. In this work, we take this approach and provide strong positive results both for the asymmetric and symmetric k-center problems under a very natural input stability (promise) condition called alpha-perturbation resilience [Bilu Linial, 2012], which states that the optimal solution does not change under any alpha-factor perturbation to the input distances. We show that by assuming 2-perturbation resilience, the exact solution for the asymmetric k-center problem can be found in polynomial time. To our knowledge, this is the first problem that is hard to approximate to any constant factor in the worst case, yet can be optimally solved in polynomial time under perturbation resilience for a constant value of alpha. Furthermore, we prove our result is tight by showing symmetric k-center under (2-epsilon)-perturbation resilience is hard unless NP=RP. This is the first tight result for any problem under perturbation resilience, i.e., this is the first time the exact value of alpha for which the problem switches from being NP-hard to efficiently computable has been found. Our results illustrate a surprising relationship between symmetric and asymmetric k-center instances under perturbation resilience. Unlike approximation ratio, for which symmetric k-center is easily solved to a factor of 2 but asymmetric k-center cannot be approximated to any constant factor, both symmetric and asymmetric k-center can be solved optimally under resilience to 2-perturbations

Dagstuhl Research Online Publication Server